{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluation of Poincare Embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook demonstrates how well Poincare embeddings perform on the tasks detailed in the [original paper](https://arxiv.org/pdf/1705.08039.pdf) about the embeddings.\n",
    "\n",
    "The following two external, open-source implementations are used - \n",
    "1. [C++](https://github.com/TatsuyaShirakawa/poincare-embedding)\n",
    "2. [Numpy](https://github.com/nishnik/poincare_embeddings)\n",
    "\n",
    "This is the list of tasks - \n",
    "1. WordNet reconstruction\n",
    "2. WordNet link prediction\n",
    "3. Link prediction in collaboration networks (evaluation incomplete)\n",
    "4. Lexical entailment on HyperLex\n",
    "\n",
    "A more detailed explanation of the tasks and the evaluation methodology is present in the individual evaluation subsections."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup\n",
    "\n",
    "The following section performs the following - \n",
    "1. Imports required python libraries and downloads the wordnet data\n",
    "2. Clones the repositories containing the C++ and Numpy implementations of the Poincare embeddings\n",
    "3. Applies patches containing minor changes to the implementations.\n",
    "4. Compiles the C++ sources to create a binary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/jayant/Projects/gensim/gensim\n"
     ]
    }
   ],
   "source": [
    "% cd ../.."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Some libraries need to be installed that are not part of Gensim\n",
    "! pip install click>=6.7 nltk>=3.2.5 prettytable>=0.7.2 pygtrie>=2.2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package wordnet to /home/jayant/nltk_data...\n",
      "[nltk_data]   Package wordnet is already up-to-date!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import csv\n",
    "from collections import OrderedDict\n",
    "from IPython.display import display, HTML\n",
    "import logging\n",
    "import os\n",
    "import pickle\n",
    "import random\n",
    "import re\n",
    "\n",
    "import click\n",
    "from gensim.models.poincare import PoincareModel, PoincareRelations, \\\n",
    "    ReconstructionEvaluation, LinkPredictionEvaluation, \\\n",
    "    LexicalEntailmentEvaluation, PoincareKeyedVectors\n",
    "from gensim.utils import check_output\n",
    "import nltk\n",
    "from prettytable import PrettyTable\n",
    "from smart_open import smart_open\n",
    "\n",
    "logging.basicConfig(level=logging.INFO)\n",
    "nltk.download('wordnet')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please set the variable `parent_directory` below to change the directory to which the repositories are cloned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/jayant/Projects/gensim/gensim/docs/notebooks\n"
     ]
    }
   ],
   "source": [
    "% cd docs/notebooks/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "current_directory = os.getcwd()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Change this variable to `False` to not remove and re-download repos for external implementations\n",
    "force_setup = False\n",
    "\n",
    "# The poincare datasets, models and source code for external models are downloaded to this directory\n",
    "parent_directory = os.path.join(current_directory, 'poincare')\n",
    "! mkdir -p {parent_directory}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/jayant/Projects/gensim/gensim/docs/notebooks/poincare\n"
     ]
    }
   ],
   "source": [
    "% cd {parent_directory}\n",
    "\n",
    "# Clone repos\n",
    "np_repo_name = 'poincare-np-embedding'\n",
    "if force_setup and os.path.exists(np_repo_name):\n",
    "    ! rm -rf {np_repo_name}\n",
    "clone_np_repo = not os.path.exists(np_repo_name)\n",
    "if clone_np_repo:\n",
    "    ! git clone https://github.com/nishnik/poincare_embeddings.git {np_repo_name}\n",
    "\n",
    "cpp_repo_name = 'poincare-cpp-embedding'\n",
    "if force_setup and os.path.exists(cpp_repo_name):\n",
    "    ! rm -rf {cpp_repo_name}\n",
    "clone_cpp_repo = not os.path.exists(cpp_repo_name)\n",
    "if clone_cpp_repo:\n",
    "    ! git clone https://github.com/TatsuyaShirakawa/poincare-embedding.git {cpp_repo_name}\n",
    "\n",
    "patches_applied = False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply patches\n",
    "if clone_cpp_repo and not patches_applied:\n",
    "    % cd {cpp_repo_name}\n",
    "    ! git apply ../poincare_burn_in_eps.patch\n",
    "\n",
    "if clone_np_repo and not patches_applied:\n",
    "    % cd ../{np_repo_name}\n",
    "    ! git apply ../poincare_numpy.patch\n",
    "    \n",
    "patches_applied = True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/home/jayant/projects/gensim/docs/notebooks/poincare/poincare-cpp-embedding\n",
      "/home/jayant/projects/gensim/docs/notebooks/poincare/poincare-cpp-embedding/work\n",
      "-- Configuring done\n",
      "-- Generating done\n",
      "-- Build files have been written to: /home/jayant/projects/gensim/docs/notebooks/poincare/poincare-cpp-embedding/work\n",
      "[100%] Built target poincare_embedding\n",
      "/home/jayant/projects/gensim/docs/notebooks\n"
     ]
    }
   ],
   "source": [
    "# Compile the code for the external c++ implementation into a binary\n",
    "% cd {parent_directory}/{cpp_repo_name}\n",
    "! mkdir -p work\n",
    "% cd work\n",
    "! cmake ..\n",
    "! make\n",
    "% cd {current_directory}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You might need to install an updated version of `cmake` to be able to compile the source code. Please make sure that the binary `poincare_embedding` has been created before proceeding by verifying the above cell does not raise an error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "cpp_binary_path = os.path.join(parent_directory, cpp_repo_name, 'work', 'poincare_embedding')\n",
    "assert(os.path.exists(cpp_binary_path)), 'Binary file doesnt exist at %s' % cpp_binary_path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Training\n",
    "\n",
    "### 2.1 Create the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# These directories are auto created in the current directory for storing poincare datasets and models\n",
    "data_directory = os.path.join(parent_directory, 'data')\n",
    "models_directory = os.path.join(parent_directory, 'models')\n",
    "\n",
    "# Create directories\n",
    "! mkdir -p {data_directory}\n",
    "! mkdir -p {models_directory}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prepare the WordNet data\n",
    "# Can also be downloaded directly from -\n",
    "# https://github.com/jayantj/gensim/raw/wordnet_data/docs/notebooks/poincare/data/wordnet_noun_hypernyms.tsv\n",
    "\n",
    "wordnet_file = os.path.join(data_directory, 'wordnet_noun_hypernyms.tsv')\n",
    "if not os.path.exists(wordnet_file):\n",
    "    ! python {parent_directory}/{cpp_repo_name}/scripts/create_wordnet_noun_hierarchy.py {wordnet_file}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2017-11-14 11:15:54--  http://people.ds.cam.ac.uk/iv250/paper/hyperlex/hyperlex-data.zip\n",
      "Resolving people.ds.cam.ac.uk (people.ds.cam.ac.uk)... 131.111.3.47\n",
      "Connecting to people.ds.cam.ac.uk (people.ds.cam.ac.uk)|131.111.3.47|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 183900 (180K) [application/zip]\n",
      "Saving to: ‘/home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex-data.zip’\n",
      "\n",
      "/home/jayant/projec 100%[===================>] 179.59K  --.-KB/s    in 0.06s   \n",
      "\n",
      "2017-11-14 11:15:54 (2.94 MB/s) - ‘/home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex-data.zip’ saved [183900/183900]\n",
      "\n",
      "Archive:  /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex-data.zip\n",
      "   creating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/nouns-verbs/\n",
      "  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/nouns-verbs/hyperlex-verbs.txt  \n",
      "  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/nouns-verbs/hyperlex-nouns.txt  \n",
      "   creating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/\n",
      "   creating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/random/\n",
      "  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/random/hyperlex_training_all_random.txt  \n",
      "  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/random/hyperlex_test_all_random.txt  \n",
      "  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/random/hyperlex_dev_all_random.txt  \n",
      "   creating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/lexical/\n",
      "  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/lexical/hyperlex_dev_all_lexical.txt  \n",
      "  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/lexical/hyperlex_test_all_lexical.txt  \n",
      "  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/lexical/hyperlex_training_all_lexical.txt  \n",
      "  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/hyperlex-all.txt  \n",
      "  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/README.txt  \n"
     ]
    }
   ],
   "source": [
    "# Prepare the HyperLex data\n",
    "hyperlex_url = \"http://people.ds.cam.ac.uk/iv250/paper/hyperlex/hyperlex-data.zip\"\n",
    "! wget {hyperlex_url} -O {data_directory}/hyperlex-data.zip\n",
    "if os.path.exists(os.path.join(data_directory, 'hyperlex')):\n",
    "    ! rm -r {data_directory}/hyperlex\n",
    "! unzip {data_directory}/hyperlex-data.zip -d {data_directory}/hyperlex/\n",
    "hyperlex_file = os.path.join(data_directory, 'hyperlex', 'nouns-verbs', 'hyperlex-nouns.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Training [C++ embeddings](https://github.com/TatsuyaShirakawa/poincare-embedding)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_cpp_model(\n",
    "    binary_path, data_file, output_file, dim, epochs, neg,\n",
    "    num_threads, epsilon, burn_in, seed=0):\n",
    "    \"\"\"Train a poincare embedding using the c++ implementation\n",
    "    \n",
    "    Args:\n",
    "        binary_path (str): Path to the compiled c++ implementation binary\n",
    "        data_file (str): Path to tsv file containing relation pairs\n",
    "        output_file (str): Path to output file containing model\n",
    "        dim (int): Number of dimensions of the trained model\n",
    "        epochs (int): Number of epochs to use\n",
    "        neg (int): Number of negative samples to use\n",
    "        num_threads (int): Number of threads to use for training the model\n",
    "        epsilon (float): Constant used for clipping below a norm of one\n",
    "        burn_in (int): Number of epochs to use for burn-in init (0 means no burn-in)\n",
    "    \n",
    "    Notes: \n",
    "        If `output_file` already exists, skips training\n",
    "    \"\"\"\n",
    "    if os.path.exists(output_file):\n",
    "        print('File %s exists, skipping' % output_file)\n",
    "        return\n",
    "    args = {\n",
    "        'dim': dim,\n",
    "        'max_epoch': epochs,\n",
    "        'neg_size': neg,\n",
    "        'num_thread': num_threads,\n",
    "        'epsilon': epsilon,\n",
    "        'burn_in': burn_in,\n",
    "        'learning_rate_init': 0.1,\n",
    "        'learning_rate_final': 0.0001,\n",
    "    }\n",
    "    cmd = [binary_path, data_file, output_file]\n",
    "    for option, value in args.items():\n",
    "        cmd.append(\"--%s\" % option)\n",
    "        cmd.append(str(value))\n",
    "    \n",
    "    return check_output(args=cmd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_sizes = [5, 10, 20, 50, 100, 200]\n",
    "default_params = {\n",
    "    'neg': 20,\n",
    "    'epochs': 50,\n",
    "    'threads': 8,\n",
    "    'eps': 1e-6,\n",
    "    'burn_in': 0,\n",
    "    'batch_size': 10,\n",
    "    'reg': 0.0\n",
    "}\n",
    "\n",
    "non_default_params = {\n",
    "    'neg': [10],\n",
    "    'epochs': [200],\n",
    "    'burn_in': [10]\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "def cpp_model_name_from_params(params, prefix):\n",
    "    param_keys = ['burn_in', 'epochs', 'neg', 'eps', 'threads']\n",
    "    name = ['%s_%s' % (key, params[key]) for key in sorted(param_keys)]\n",
    "    return '%s_%s' % (prefix, '_'.join(name))\n",
    "\n",
    "def train_model_with_params(params, train_file, model_sizes, prefix, implementation):\n",
    "    \"\"\"Trains models with given params for multiple model sizes using the given implementation\n",
    "    \n",
    "    Args:\n",
    "        params (dict): parameters to train the model with\n",
    "        train_file (str): Path to tsv file containing relation pairs\n",
    "        model_sizes (list): list of dimension sizes (integer) to train the model with\n",
    "        prefix (str): prefix to use for the saved model filenames\n",
    "        implementation (str): whether to use the numpy or c++ implementation,\n",
    "                              allowed values: 'numpy', 'c++'\n",
    "   \n",
    "   Returns:\n",
    "        tuple (model_name, model_files)\n",
    "        model_files is a dict of (size, filename) pairs\n",
    "        Example: ('cpp_model_epochs_50', {5: 'models/cpp_model_epochs_50_dim_5'})\n",
    "    \"\"\"\n",
    "    files = {}\n",
    "    if implementation == 'c++':\n",
    "        model_name = cpp_model_name_from_params(params, prefix)\n",
    "    elif implementation == 'numpy':\n",
    "        model_name = np_model_name_from_params(params, prefix)\n",
    "    elif implementation == 'gensim':\n",
    "        model_name = gensim_model_name_from_params(params, prefix)\n",
    "    else:\n",
    "        raise ValueError('Given implementation %s not found' % implementation)\n",
    "    for model_size in model_sizes:\n",
    "        output_file_name = '%s_dim_%d' % (model_name, model_size)\n",
    "        output_file = os.path.join(models_directory, output_file_name)\n",
    "        print('Training model %s of size %d' % (model_name, model_size))\n",
    "        if implementation == 'c++':\n",
    "            out = train_cpp_model(\n",
    "                cpp_binary_path, train_file, output_file, model_size,\n",
    "                params['epochs'], params['neg'], params['threads'],\n",
    "                params['eps'], params['burn_in'], seed=0)\n",
    "        elif implementation == 'numpy':\n",
    "            train_external_numpy_model(\n",
    "                python_script_path, train_file, output_file, model_size,\n",
    "                params['epochs'], params['neg'], seed=0)\n",
    "        elif implementation == 'gensim':\n",
    "            train_gensim_model(\n",
    "                train_file, output_file, model_size, params['epochs'],\n",
    "                params['neg'], params['burn_in'], params['batch_size'], params['reg'], seed=0)\n",
    "        else:\n",
    "            raise ValueError('Given implementation %s not found' % implementation)\n",
    "        files[model_size] = output_file\n",
    "    return (model_name, files)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_files = {}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "model_files['c++'] = {}\n",
    "# Train c++ models with default params\n",
    "model_name, files = train_model_with_params(default_params, wordnet_file, model_sizes, 'cpp_model', 'c++')\n",
    "model_files['c++'][model_name] = {}\n",
    "for dim, filepath in files.items():\n",
    "    model_files['c++'][model_name][dim] = filepath\n",
    "# Train c++ models with non-default params\n",
    "for param, values in non_default_params.items():\n",
    "    params = default_params.copy()\n",
    "    for value in values:\n",
    "        params[param] = value\n",
    "        model_name, files = train_model_with_params(params, wordnet_file, model_sizes, 'cpp_model', 'c++')\n",
    "        model_files['c++'][model_name] = {}\n",
    "        for dim, filepath in files.items():\n",
    "            model_files['c++'][model_name][dim] = filepath"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Training [numpy embeddings](https://github.com/nishnik/poincare_embeddings) (non-gensim)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "python_script_path = os.path.join(parent_directory, np_repo_name, 'poincare.py')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "def np_model_name_from_params(params, prefix):\n",
    "    param_keys = ['neg', 'epochs']\n",
    "    name = ['%s_%s' % (key, params[key]) for key in sorted(param_keys)]\n",
    "    return '%s_%s' % (prefix, '_'.join(name))\n",
    "\n",
    "def train_external_numpy_model(\n",
    "    script_path, data_file, output_file, dim, epochs, neg, seed=0):\n",
    "    \"\"\"Train a poincare embedding using an external numpy implementation\n",
    "    \n",
    "    Args:\n",
    "        script_path (str): Path to the Python training script\n",
    "        data_file (str): Path to tsv file containing relation pairs\n",
    "        output_file (str): Path to output file containing model\n",
    "        dim (int): Number of dimensions of the trained model\n",
    "        epochs (int): Number of epochs to use\n",
    "        neg (int): Number of negative samples to use\n",
    "    \n",
    "    Notes: \n",
    "        If `output_file` already exists, skips training\n",
    "    \"\"\"\n",
    "    if os.path.exists(output_file):\n",
    "        print('File %s exists, skipping' % output_file)\n",
    "        return\n",
    "    args = {\n",
    "        'input-file': data_file,\n",
    "        'output-file': output_file,\n",
    "        'dimensions': dim,\n",
    "        'epochs': epochs,\n",
    "        'learning-rate': 0.01,\n",
    "        'num-negative': neg,\n",
    "    }\n",
    "    cmd = ['python', script_path]\n",
    "    for option, value in args.items():\n",
    "        cmd.append(\"--%s\" % option)\n",
    "        cmd.append(str(value))\n",
    "    \n",
    "    return check_output(args=cmd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_files['numpy'] = {}\n",
    "# Train models with default params\n",
    "model_name, files = train_model_with_params(default_params, wordnet_file, model_sizes, 'np_model', 'numpy')\n",
    "model_files['numpy'][model_name] = {}\n",
    "for dim, filepath in files.items():\n",
    "    model_files['numpy'][model_name][dim] = filepath"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 Training gensim embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "def gensim_model_name_from_params(params, prefix):\n",
    "    param_keys = ['neg', 'epochs', 'burn_in', 'batch_size', 'reg']\n",
    "    name = ['%s_%s' % (key, params[key]) for key in sorted(param_keys)]\n",
    "    return '%s_%s' % (prefix, '_'.join(name))\n",
    "\n",
    "def train_gensim_model(\n",
    "    data_file, output_file, dim, epochs, neg, burn_in, batch_size, reg, seed=0):\n",
    "    \"\"\"Train a poincare embedding using gensim implementation\n",
    "    \n",
    "    Args:\n",
    "        data_file (str): Path to tsv file containing relation pairs\n",
    "        output_file (str): Path to output file containing model\n",
    "        dim (int): Number of dimensions of the trained model\n",
    "        epochs (int): Number of epochs to use\n",
    "        neg (int): Number of negative samples to use\n",
    "        burn_in (int): Number of epochs to use for burn-in initialization\n",
    "        batch_size (int): Size of batch to use for training\n",
    "        reg (float): Coefficient used for l2-regularization while training\n",
    "    \n",
    "    Notes: \n",
    "        If `output_file` already exists, skips training\n",
    "    \"\"\"\n",
    "    if os.path.exists(output_file):\n",
    "        print('File %s exists, skipping' % output_file)\n",
    "        return\n",
    "    train_data = PoincareRelations(data_file)\n",
    "    model = PoincareModel(train_data, size=dim, negative=neg, burn_in=burn_in, regularization_coeff=reg)\n",
    "    model.train(epochs=epochs, batch_size=batch_size)\n",
    "    model.save(output_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "non_default_params_gensim = [\n",
    "    {'neg': 10,},\n",
    "    {'burn_in': 10,},\n",
    "    {'batch_size': 50,},\n",
    "    {'neg': 10, 'reg': 1, 'burn_in': 10, 'epochs': 200},\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_files['gensim'] = {}\n",
    "# Train models with default params\n",
    "model_name, files = train_model_with_params(default_params, wordnet_file, model_sizes, 'gensim_model', 'gensim')\n",
    "model_files['gensim'][model_name] = {}\n",
    "for dim, filepath in files.items():\n",
    "    model_files['gensim'][model_name][dim] = filepath\n",
    "# Train models with non-default params\n",
    "for new_params in non_default_params_gensim:\n",
    "    params = default_params.copy()\n",
    "    params.update(new_params)\n",
    "    model_name, files = train_model_with_params(params, wordnet_file, model_sizes, 'gensim_model', 'gensim')\n",
    "    model_files['gensim'][model_name] = {}\n",
    "    for dim, filepath in files.items():\n",
    "        model_files['gensim'][model_name][dim] = filepath"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Loading the embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "def transform_cpp_embedding_to_kv(input_file, output_file, encoding='utf8'):\n",
    "    \"\"\"Given a C++ embedding tsv filepath, converts it to a KeyedVector-supported file\"\"\"\n",
    "    with smart_open(input_file, 'rb') as f:\n",
    "        lines = [line.decode(encoding) for line in f]\n",
    "    if not len(lines):\n",
    "         raise ValueError(\"file is empty\")\n",
    "    first_line = lines[0]\n",
    "    parts = first_line.rstrip().split(\"\\t\")\n",
    "    model_size = len(parts) - 1\n",
    "    vocab_size = len(lines)\n",
    "    with smart_open(output_file, 'w') as f:\n",
    "        f.write('%d %d\\n' % (vocab_size, model_size))\n",
    "        for line in lines:\n",
    "            f.write(line.replace('\\t', ' '))\n",
    "\n",
    "def transform_numpy_embedding_to_kv(input_file, output_file, encoding='utf8'):\n",
    "    \"\"\"Given a numpy poincare embedding pkl filepath, converts it to a KeyedVector-supported file\"\"\"\n",
    "    np_embeddings = pickle.load(open(input_file, 'rb'))\n",
    "    random_embedding = np_embeddings[list(np_embeddings.keys())[0]]\n",
    "    \n",
    "    model_size = random_embedding.shape[0]\n",
    "    vocab_size = len(np_embeddings)\n",
    "    with smart_open(output_file, 'w') as f:\n",
    "        f.write('%d %d\\n' % (vocab_size, model_size))\n",
    "        for key, vector in np_embeddings.items():\n",
    "            vector_string = ' '.join('%.6f' % value for value in vector)\n",
    "            f.write('%s %s\\n' % (key, vector_string))\n",
    "\n",
    "def load_poincare_cpp(input_filename):\n",
    "    \"\"\"Load embedding trained via C++ Poincare model.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    filepath : str\n",
    "        Path to tsv file containing embedding.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    PoincareKeyedVectors instance.\n",
    "\n",
    "    \"\"\"\n",
    "    keyed_vectors_filename = input_filename + '.kv'\n",
    "    transform_cpp_embedding_to_kv(input_filename, keyed_vectors_filename)\n",
    "    embedding = PoincareKeyedVectors.load_word2vec_format(keyed_vectors_filename)\n",
    "    os.unlink(keyed_vectors_filename)\n",
    "    return embedding\n",
    "\n",
    "def load_poincare_numpy(input_filename):\n",
    "    \"\"\"Load embedding trained via Python numpy Poincare model.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    filepath : str\n",
    "        Path to pkl file containing embedding.\n",
    "\n",
    "    Returns:\n",
    "        PoincareKeyedVectors instance.\n",
    "\n",
    "    \"\"\"\n",
    "    keyed_vectors_filename = input_filename + '.kv'\n",
    "    transform_numpy_embedding_to_kv(input_filename, keyed_vectors_filename)\n",
    "    embedding = PoincareKeyedVectors.load_word2vec_format(keyed_vectors_filename)\n",
    "    os.unlink(keyed_vectors_filename)\n",
    "    return embedding\n",
    "\n",
    "def load_poincare_gensim(input_filename):\n",
    "    \"\"\"Load embedding trained via Gensim PoincareModel.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    filepath : str\n",
    "        Path to model file.\n",
    "\n",
    "    Returns:\n",
    "        PoincareKeyedVectors instance.\n",
    "\n",
    "    \"\"\"\n",
    "    model = PoincareModel.load(input_filename)\n",
    "    return model.kv\n",
    "\n",
    "def load_model(implementation, model_file):\n",
    "    \"\"\"Convenience function over functions to load models from different implementations.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    implementation : str\n",
    "        Implementation used to create model file ('c++'/'numpy'/'gensim').\n",
    "    model_file : str\n",
    "        Path to model file.\n",
    "    \n",
    "    Returns\n",
    "    -------\n",
    "    PoincareKeyedVectors instance\n",
    "    \n",
    "    Notes\n",
    "    -----\n",
    "    Raises ValueError in case of invalid value for `implementation`\n",
    "\n",
    "    \"\"\"\n",
    "    if implementation == 'c++':\n",
    "        return load_poincare_cpp(model_file)\n",
    "    elif implementation == 'numpy':\n",
    "        return load_poincare_numpy(model_file)\n",
    "    elif implementation == 'gensim':\n",
    "        return load_poincare_gensim(model_file)\n",
    "    else:\n",
    "        raise ValueError('Invalid implementation %s' % implementation)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 176,
   "metadata": {},
   "outputs": [],
   "source": [
    "def display_results(task_name, results):\n",
    "    \"\"\"Display evaluation results of multiple embeddings on a single task in a tabular format\n",
    "    \n",
    "    Args:\n",
    "        task_name (str): name the task being evaluated\n",
    "        results (dict): mapping between embeddings and corresponding results\n",
    "    \n",
    "    \"\"\"\n",
    "    result_table = PrettyTable()\n",
    "    result_table.field_names = [\"Model Description\", \"Metric\"] + [str(dim) for dim in sorted(model_sizes)]\n",
    "    for model_name, model_results in results.items():\n",
    "        metrics = [metric for metric in model_results.keys()]\n",
    "        dims = sorted([dim for dim in model_results[metrics[0]].keys()])\n",
    "        description = model_description_from_name(model_name)\n",
    "        row = [description, '\\n'.join(metrics) + '\\n']\n",
    "        for dim in dims:\n",
    "            scores = ['%.2f' % model_results[metric][dim] for metric in metrics]\n",
    "            row.append('\\n'.join(scores))\n",
    "        result_table.add_row(row)\n",
    "    result_table.align = 'r'\n",
    "    result_html = result_table.get_html_string()\n",
    "    search = \"<table>\"\n",
    "    insert_at = result_html.index(search) + len(search)\n",
    "    new_row = \"\"\"\n",
    "        <tr>\n",
    "            <th colspan=\"1\" style=\"text-align:left\">%s</th>\n",
    "            <th colspan=\"1\"></th>\n",
    "            <th colspan=\"%d\" style=\"text-align:center\"> Dimensions</th>\n",
    "        </tr>\"\"\" % (task_name, len(model_sizes))\n",
    "    result_html = result_html[:insert_at] + new_row + result_html[insert_at:]\n",
    "    display(HTML(result_html))\n",
    "    \n",
    "def model_description_from_name(model_name):\n",
    "    if model_name.startswith('gensim'):\n",
    "        implementation = 'Gensim'\n",
    "    elif model_name.startswith('cpp'):\n",
    "        implementation = 'C++'\n",
    "    elif model_name.startswith('np'):\n",
    "        implementation = 'Numpy'\n",
    "    else:\n",
    "        raise ValueError('Unsupported implementation for model: %s' % model_name)\n",
    "    description = []\n",
    "    for param_key in sorted(default_params.keys()):\n",
    "        pattern = '%s_([^_]*)_?' % param_key\n",
    "        match = re.search(pattern, model_name)\n",
    "        if match:\n",
    "            description.append(\"%s=%s\" % (param_key, match.groups()[0]))\n",
    "    return \"%s: %s\" % (implementation, \", \".join(description))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1 WordNet reconstruction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this task, embeddings are learnt using the entire transitive closure of the WordNet noun hypernym hierarchy. Subsequently, for every hypernym pair `(u, v)`, the rank of `v` amongst all nodes that do not have a positive edge with `v` is computed. The final metric `mean_rank` is the average of all these ranks. The `MAP` metric is the mean of the Average Precision of the rankings for all positive nodes for a given node `u`.\n",
    "\n",
    "Note that this task tests representation capacity of the learnt embeddings, and not the generalization ability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "reconstruction_results = OrderedDict()\n",
    "metrics = ['mean_rank', 'MAP']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "for implementation, models in sorted(model_files.items()):\n",
    "    for model_name, files in models.items():\n",
    "        if model_name in reconstruction_results:\n",
    "            continue\n",
    "        reconstruction_results[model_name] = OrderedDict()\n",
    "        for metric in metrics:\n",
    "            reconstruction_results[model_name][metric] = {}\n",
    "        for model_size, model_file in files.items():\n",
    "            print('Evaluating model %s of size %d' % (model_name, model_size))\n",
    "            embedding = load_model(implementation, model_file)\n",
    "            eval_instance = ReconstructionEvaluation(wordnet_file, embedding)\n",
    "            eval_result = eval_instance.evaluate(max_n=1000)\n",
    "            for metric in metrics:\n",
    "                reconstruction_results[model_name][metric][model_size] = eval_result[metric]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 148,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "    <tr>\n",
       "        <th colspan=\"1\" style=\"text-align:left\">WordNet Reconstruction</th><th \n",
       "        <th colspan=\"1\"></th>\n",
       "        <th colspan=\"6\" style=\"text-align:center\"> Dimensions</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <th>Model Description</th>\n",
       "        <th>Metric</th>\n",
       "        <th>5</th>\n",
       "        <th>10</th>\n",
       "        <th>20</th>\n",
       "        <th>50</th>\n",
       "        <th>100</th>\n",
       "        <th>200</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>C++: burn_in=0, epochs=200, eps=1e-06, neg=20, threads=8</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>191.69<br>0.34</td>\n",
       "        <td>97.65<br>0.43</td>\n",
       "        <td>72.07<br>0.51</td>\n",
       "        <td>55.48<br>0.57</td>\n",
       "        <td>46.76<br>0.59</td>\n",
       "        <td>49.62<br>0.59</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>C++: burn_in=0, epochs=50, eps=1e-06, neg=10, threads=8</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>280.17<br>0.27</td>\n",
       "        <td>129.46<br>0.40</td>\n",
       "        <td>92.06<br>0.49</td>\n",
       "        <td>80.41<br>0.53</td>\n",
       "        <td>71.42<br>0.56</td>\n",
       "        <td>69.30<br>0.56</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>C++: burn_in=0, epochs=50, eps=1e-06, neg=20, threads=8</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>265.72<br>0.28</td>\n",
       "        <td>116.94<br>0.41</td>\n",
       "        <td>90.81<br>0.49</td>\n",
       "        <td>59.47<br>0.56</td>\n",
       "        <td>55.14<br>0.58</td>\n",
       "        <td>54.31<br>0.59</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>C++: burn_in=10, epochs=50, eps=1e-06, neg=20, threads=8</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>252.86<br>0.26</td>\n",
       "        <td>195.73<br>0.32</td>\n",
       "        <td>182.57<br>0.34</td>\n",
       "        <td>165.33<br>0.36</td>\n",
       "        <td>157.37<br>0.36</td>\n",
       "        <td>155.78<br>0.36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Gensim: batch_size=10, burn_in=10, epochs=50, neg=20, reg=0.0</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>108.01<br>0.37</td>\n",
       "        <td>100.73<br>0.47</td>\n",
       "        <td>97.38<br>0.48</td>\n",
       "        <td>94.49<br>0.49</td>\n",
       "        <td>94.68<br>0.48</td>\n",
       "        <td>89.66<br>0.49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Gensim: batch_size=10, burn_in=0, epochs=50, neg=20, reg=0.0</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>154.41<br>0.40</td>\n",
       "        <td>62.77<br>0.63</td>\n",
       "        <td>27.32<br>0.72</td>\n",
       "        <td>20.22<br>0.77</td>\n",
       "        <td>16.15<br>0.78</td>\n",
       "        <td>13.20<br>0.79</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Gensim: batch_size=10, burn_in=0, epochs=50, neg=10, reg=0.0</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>211.71<br>0.33</td>\n",
       "        <td>54.42<br>0.60</td>\n",
       "        <td>24.90<br>0.72</td>\n",
       "        <td>21.42<br>0.76</td>\n",
       "        <td>15.80<br>0.78</td>\n",
       "        <td>15.13<br>0.79</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Gensim: batch_size=50, burn_in=0, epochs=50, neg=20, reg=0.0</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>148.51<br>0.38</td>\n",
       "        <td>63.67<br>0.62</td>\n",
       "        <td>28.36<br>0.72</td>\n",
       "        <td>20.23<br>0.76</td>\n",
       "        <td>15.75<br>0.78</td>\n",
       "        <td>13.59<br>0.79</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Gensim: batch_size=10, burn_in=10, epochs=200, neg=10, reg=1</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>61.48<br>0.38</td>\n",
       "        <td>54.70<br>0.41</td>\n",
       "        <td>53.02<br>0.41</td>\n",
       "        <td>50.80<br>0.42</td>\n",
       "        <td>49.58<br>0.42</td>\n",
       "        <td>48.56<br>0.43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Numpy: epochs=50, neg=20</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>9617.57<br>0.14</td>\n",
       "        <td>5902.65<br>0.16</td>\n",
       "        <td>3868.78<br>0.19</td>\n",
       "        <td>1117.77<br>0.25</td>\n",
       "        <td>529.92<br>0.30</td>\n",
       "        <td>377.45<br>0.35</td>\n",
       "    </tr>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display_results('WordNet Reconstruction', reconstruction_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Results from the paper -\n",
    "![Reconstruction Results](https://raw.githubusercontent.com/RaRe-Technologies/gensim/poincare_model_keyedvectors/docs/notebooks/poincare/reconstruction_paper.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The figures above illustrate a few things - \n",
    "1. The gensim implementation does significantly better for all model sizes and hyperparameters than both the other implementations.\n",
    "2. The results from the original paper have not been achieved by our implementation. Especially for models with lower dimensions, the paper mentions significantly better mean rank and MAP for the reconstruction task.\n",
    "3. Using burn-in and regularization leads to much better results with low model sizes, however the results do not improve significantly with increasing model size. This might have to do with tuning the regularization coefficient, which the paper does not mention."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 WordNet link prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This task is similar to the reconstruction task described above, except that the list of relations is split into a training and testing set, and the mean rank reported is for the edges in the test set.\n",
    "\n",
    "Therefore, this tests the ability of the model to predict unseen edges between nodes, i.e. generalization ability, as opposed to the representation capacity tested in the Reconstruction task"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.2.1 Preparing data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_test_split(data_file, test_ratio=0.1):\n",
    "    \"\"\"Creates train and test files from given data file, returns train/test file names\n",
    "    \n",
    "    Args:\n",
    "        data_file (str): path to data file for which train/test split is to be created\n",
    "        test_ratio (float): fraction of lines to be used for test data\n",
    "    \n",
    "    Returns\n",
    "        (train_file, test_file): tuple of strings with train file and test file paths\n",
    "    \"\"\"\n",
    "    train_filename = data_file + '.train'\n",
    "    test_filename = data_file + '.test'\n",
    "    if os.path.exists(train_filename) and os.path.exists(test_filename):\n",
    "        print('Train and test files already exist, skipping')\n",
    "        return (train_filename, test_filename)\n",
    "    root_nodes, leaf_nodes = get_root_and_leaf_nodes(data_file)\n",
    "    test_line_candidates = []\n",
    "    line_count = 0\n",
    "    all_nodes = set()\n",
    "    with smart_open(data_file, 'rb') as f:\n",
    "        for i, line in enumerate(f):\n",
    "            node_1, node_2 = line.split()\n",
    "            all_nodes.update([node_1, node_2])\n",
    "            if (\n",
    "                    node_1 not in leaf_nodes\n",
    "                    and node_2 not in leaf_nodes\n",
    "                    and node_1 not in root_nodes\n",
    "                    and node_2 not in root_nodes\n",
    "                    and node_1 != node_2\n",
    "                ):\n",
    "                test_line_candidates.append(i)\n",
    "            line_count += 1\n",
    "\n",
    "    num_test_lines = int(test_ratio * line_count)\n",
    "    if num_test_lines > len(test_line_candidates):\n",
    "        raise ValueError('Not enough candidate relations for test set')\n",
    "    print('Choosing %d test lines from %d candidates' % (num_test_lines, len(test_line_candidates)))\n",
    "    test_line_indices = set(random.sample(test_line_candidates, num_test_lines))\n",
    "    train_line_indices = set(l for l in range(line_count) if l not in test_line_indices)\n",
    "    \n",
    "    train_set_nodes = set()\n",
    "    with smart_open(data_file, 'rb') as f:\n",
    "        train_file = smart_open(train_filename, 'wb')\n",
    "        test_file = smart_open(test_filename, 'wb')\n",
    "        for i, line in enumerate(f):\n",
    "            if i in train_line_indices:\n",
    "                train_set_nodes.update(line.split())\n",
    "                train_file.write(line)\n",
    "            elif i in test_line_indices:\n",
    "                test_file.write(line)\n",
    "            else:\n",
    "                raise AssertionError('Line %d not present in either train or test line indices' % i)\n",
    "        train_file.close()\n",
    "        test_file.close()\n",
    "    assert len(train_set_nodes) == len(all_nodes), 'Not all nodes from dataset present in train set relations'\n",
    "    return (train_filename, test_filename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_root_and_leaf_nodes(data_file):\n",
    "    \"\"\"Return keys of root and leaf nodes from a file with transitive closure relations\n",
    "    \n",
    "    Args:\n",
    "        data_file(str): file path containing transitive closure relations\n",
    "    \n",
    "    Returns:\n",
    "        (root_nodes, leaf_nodes) - tuple containing keys of root and leaf nodes\n",
    "    \"\"\"\n",
    "    root_candidates = set()\n",
    "    leaf_candidates = set()\n",
    "    with smart_open(data_file, 'rb') as f:\n",
    "        for line in f:\n",
    "            nodes = line.split()\n",
    "            root_candidates.update(nodes)\n",
    "            leaf_candidates.update(nodes)\n",
    "    \n",
    "    with smart_open(data_file, 'rb') as f:\n",
    "        for line in f:\n",
    "            node_1, node_2 = line.split()\n",
    "            if node_1 == node_2:\n",
    "                continue\n",
    "            leaf_candidates.discard(node_1)\n",
    "            root_candidates.discard(node_2)\n",
    "    \n",
    "    return (leaf_candidates, root_candidates)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train and test files already exist, skipping\n"
     ]
    }
   ],
   "source": [
    "wordnet_train_file, wordnet_test_file = train_test_split(wordnet_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.2.2 Training models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Training models for link prediction\n",
    "lp_model_files = {}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lp_model_files['c++'] = {}\n",
    "# Train c++ models with default params\n",
    "model_name, files = train_model_with_params(default_params, wordnet_train_file, model_sizes, 'cpp_lp_model', 'c++')\n",
    "lp_model_files['c++'][model_name] = {}\n",
    "for dim, filepath in files.items():\n",
    "    lp_model_files['c++'][model_name][dim] = filepath\n",
    "# Train c++ models with non-default params\n",
    "for param, values in non_default_params.items():\n",
    "    params = default_params.copy()\n",
    "    for value in values:\n",
    "        params[param] = value\n",
    "        model_name, files = train_model_with_params(params, wordnet_train_file, model_sizes, 'cpp_lp_model', 'c++')\n",
    "        lp_model_files['c++'][model_name] = {}\n",
    "        for dim, filepath in files.items():\n",
    "            lp_model_files['c++'][model_name][dim] = filepath"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lp_model_files['numpy'] = {}\n",
    "# Train numpy models with default params\n",
    "model_name, files = train_model_with_params(default_params, wordnet_train_file, model_sizes, 'np_lp_model', 'numpy')\n",
    "lp_model_files['numpy'][model_name] = {}\n",
    "for dim, filepath in files.items():\n",
    "    lp_model_files['numpy'][model_name][dim] = filepath"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lp_model_files['gensim'] = {}\n",
    "# Train models with default params\n",
    "model_name, files = train_model_with_params(default_params, wordnet_train_file, model_sizes, 'gensim_lp_model', 'gensim')\n",
    "lp_model_files['gensim'][model_name] = {}\n",
    "for dim, filepath in files.items():\n",
    "    lp_model_files['gensim'][model_name][dim] = filepath\n",
    "# Train models with non-default params\n",
    "for new_params in non_default_params_gensim:\n",
    "    params = default_params.copy()\n",
    "    params.update(new_params)\n",
    "    model_name, files = train_model_with_params(params, wordnet_file, model_sizes, 'gensim_lp_model', 'gensim')\n",
    "    lp_model_files['gensim'][model_name] = {}\n",
    "    for dim, filepath in files.items():\n",
    "        lp_model_files['gensim'][model_name][dim] = filepath"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.2.3 Evaluating models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [],
   "source": [
    "lp_results = OrderedDict()\n",
    "metrics = ['mean_rank', 'MAP']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for implementation, models in sorted(lp_model_files.items()):\n",
    "    for model_name, files in models.items():\n",
    "        lp_results[model_name] = OrderedDict()\n",
    "        for metric in metrics:\n",
    "            lp_results[model_name][metric] = {}\n",
    "        for model_size, model_file in files.items():\n",
    "            print('Evaluating model %s of size %d' % (model_name, model_size))\n",
    "            embedding = load_model(implementation, model_file)\n",
    "            eval_instance = LinkPredictionEvaluation(wordnet_train_file, wordnet_test_file, embedding)\n",
    "            eval_result = eval_instance.evaluate(max_n=1000)\n",
    "            for metric in metrics:\n",
    "                lp_results[model_name][metric][model_size] = eval_result[metric]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 149,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "    <tr>\n",
       "        <th colspan=\"1\" style=\"text-align:left\">WordNet Link Prediction</th>\n",
       "        <th colspan=\"1\"></th>\n",
       "        <th colspan=\"6\" style=\"text-align:center\"> Dimensions</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <th>Model Description</th>\n",
       "        <th>Metric</th>\n",
       "        <th>5</th>\n",
       "        <th>10</th>\n",
       "        <th>20</th>\n",
       "        <th>50</th>\n",
       "        <th>100</th>\n",
       "        <th>200</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>C++: burn_in=0, epochs=200, eps=1e-06, neg=20, threads=8</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>218.26<br>0.15</td>\n",
       "        <td>99.09<br>0.24</td>\n",
       "        <td>60.50<br>0.31</td>\n",
       "        <td>52.24<br>0.35</td>\n",
       "        <td>60.81<br>0.36</td>\n",
       "        <td>69.13<br>0.36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>C++: burn_in=0, epochs=50, eps=1e-06, neg=20, threads=8</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>687.48<br>0.12</td>\n",
       "        <td>281.88<br>0.15</td>\n",
       "        <td>72.95<br>0.31</td>\n",
       "        <td>57.37<br>0.35</td>\n",
       "        <td>52.56<br>0.36</td>\n",
       "        <td>61.42<br>0.36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>C++: burn_in=0, epochs=50, eps=1e-06, neg=10, threads=8</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>230.34<br>0.14</td>\n",
       "        <td>123.24<br>0.22</td>\n",
       "        <td>75.62<br>0.28</td>\n",
       "        <td>65.97<br>0.31</td>\n",
       "        <td>55.33<br>0.33</td>\n",
       "        <td>56.89<br>0.34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>C++: burn_in=10, epochs=50, eps=1e-06, neg=20, threads=8</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>236.31<br>0.10</td>\n",
       "        <td>214.85<br>0.13</td>\n",
       "        <td>193.30<br>0.14</td>\n",
       "        <td>180.27<br>0.15</td>\n",
       "        <td>169.00<br>0.16</td>\n",
       "        <td>163.22<br>0.16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Gensim: batch_size=10, burn_in=0, epochs=50, neg=10, reg=0.0</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>141.52<br>0.18</td>\n",
       "        <td>58.89<br>0.34</td>\n",
       "        <td>31.66<br>0.46</td>\n",
       "        <td>22.13<br>0.51</td>\n",
       "        <td>21.29<br>0.52</td>\n",
       "        <td>19.38<br>0.53</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Gensim: batch_size=10, burn_in=0, epochs=50, neg=20, reg=0.0</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>121.42<br>0.19</td>\n",
       "        <td>52.51<br>0.37</td>\n",
       "        <td>24.61<br>0.46</td>\n",
       "        <td>19.96<br>0.52</td>\n",
       "        <td>20.44<br>0.50</td>\n",
       "        <td>19.55<br>0.54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Gensim: batch_size=50, burn_in=0, epochs=50, neg=20, reg=0.0</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>144.19<br>0.19</td>\n",
       "        <td>53.65<br>0.35</td>\n",
       "        <td>25.21<br>0.47</td>\n",
       "        <td>20.68<br>0.52</td>\n",
       "        <td>21.32<br>0.51</td>\n",
       "        <td>18.97<br>0.53</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Gensim: batch_size=10, burn_in=10, epochs=50, neg=20, reg=0.0</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>154.95<br>0.16</td>\n",
       "        <td>138.12<br>0.21</td>\n",
       "        <td>122.06<br>0.24</td>\n",
       "        <td>117.96<br>0.26</td>\n",
       "        <td>112.99<br>0.25</td>\n",
       "        <td>110.84<br>0.26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Gensim: batch_size=10, burn_in=10, epochs=200, neg=10, reg=1</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>51.72<br>0.22</td>\n",
       "        <td>39.85<br>0.28</td>\n",
       "        <td>38.60<br>0.29</td>\n",
       "        <td>36.55<br>0.30</td>\n",
       "        <td>35.32<br>0.31</td>\n",
       "        <td>34.66<br>0.31</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Numpy: epochs=50, neg=20</td>\n",
       "        <td>mean_rank<br>MAP<br></td>\n",
       "        <td>14526.67<br>0.01</td>\n",
       "        <td>8411.10<br>0.02</td>\n",
       "        <td>5749.57<br>0.04</td>\n",
       "        <td>1873.12<br>0.07</td>\n",
       "        <td>1639.50<br>0.10</td>\n",
       "        <td>1350.13<br>0.13</td>\n",
       "    </tr>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display_results('WordNet Link Prediction', lp_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Results from the paper -\n",
    "![Link Prediction Paper](https://raw.githubusercontent.com/RaRe-Technologies/gensim/poincare_model_keyedvectors/docs/notebooks/poincare/link_prediction_paper.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These results follow similar trends as the reconstruction results. Repeating here for ease of reading - \n",
    "1. The gensim implementation does significantly better for all model sizes and hyperparameters than both the other implementations.\n",
    "2. The results from the original paper have not been achieved by our implementation. Especially for models with lower dimensions, the paper mentions significantly better mean rank and MAP for the link prediction task.\n",
    "4. Using burn-in and regularization leads to better results with low model sizes, however the results do not improve significantly with increasing model size.\n",
    "\n",
    "The main difference from the reconstruction results is that mean ranks for link prediction are slightly worse most of the time than the corresponding reconstruction results. This is to be expected, as link prediction is performed on a held-out test set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.3 HyperLex Lexical Entailment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Lexical Entailment task is performed using the HyperLex dataset, a collection of 2163 noun pairs and scores that denote \"To what degree is noun A a type of noun Y\". For example - \n",
    "  \n",
    "`girl person 9.85`\n",
    "\n",
    "These scores are out of 10.\n",
    "\n",
    "The [spearman's correlation score](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) is computed for the predicted and actual similarity scores, with the models trained on the entire WordNet noun hierarchy.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 168,
   "metadata": {},
   "outputs": [],
   "source": [
    "entailment_results = OrderedDict()\n",
    "eval_instance = LexicalEntailmentEvaluation(hyperlex_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "for implementation, models in sorted(model_files.items()):\n",
    "    for model_name, files in models.items():\n",
    "        if model_name in entailment_results:\n",
    "            continue\n",
    "        entailment_results[model_name] = OrderedDict()\n",
    "        entailment_results[model_name]['spearman'] = {}\n",
    "        for model_size, model_file in files.items():\n",
    "            print('Evaluating model %s of size %d' % (model_name, model_size))\n",
    "            embedding = load_model(implementation, model_file)\n",
    "            entailment_results[model_name]['spearman'][model_size] = eval_instance.evaluate_spearman(embedding)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 170,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "    <tr>\n",
       "        <th colspan=\"1\" style=\"text-align:left\">Lexical Entailment (HyperLex)</th>\n",
       "        <th colspan=\"1\"></th>\n",
       "        <th colspan=\"6\" style=\"text-align:center\"> Dimensions</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <th>Model Description</th>\n",
       "        <th>Metric</th>\n",
       "        <th>5</th>\n",
       "        <th>10</th>\n",
       "        <th>20</th>\n",
       "        <th>50</th>\n",
       "        <th>100</th>\n",
       "        <th>200</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>C++: burn_in=0, epochs=200, eps=1e-06, neg=20, threads=8</td>\n",
       "        <td>spearman<br></td>\n",
       "        <td>0.45</td>\n",
       "        <td>0.46</td>\n",
       "        <td>0.45</td>\n",
       "        <td>0.45</td>\n",
       "        <td>0.45</td>\n",
       "        <td>0.46</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>C++: burn_in=0, epochs=50, eps=1e-06, neg=10, threads=8</td>\n",
       "        <td>spearman<br></td>\n",
       "        <td>0.42</td>\n",
       "        <td>0.41</td>\n",
       "        <td>0.43</td>\n",
       "        <td>0.42</td>\n",
       "        <td>0.43</td>\n",
       "        <td>0.43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>C++: burn_in=0, epochs=50, eps=1e-06, neg=20, threads=8</td>\n",
       "        <td>spearman<br></td>\n",
       "        <td>0.44</td>\n",
       "        <td>0.43</td>\n",
       "        <td>0.47</td>\n",
       "        <td>0.44</td>\n",
       "        <td>0.45</td>\n",
       "        <td>0.44</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>C++: burn_in=10, epochs=50, eps=1e-06, neg=20, threads=8</td>\n",
       "        <td>spearman<br></td>\n",
       "        <td>0.43</td>\n",
       "        <td>0.42</td>\n",
       "        <td>0.44</td>\n",
       "        <td>0.44</td>\n",
       "        <td>0.44</td>\n",
       "        <td>0.45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Gensim: batch_size=10, burn_in=10, epochs=50, neg=20, reg=0.0</td>\n",
       "        <td>spearman<br></td>\n",
       "        <td>0.45</td>\n",
       "        <td>0.46</td>\n",
       "        <td>0.45</td>\n",
       "        <td>0.46</td>\n",
       "        <td>0.45</td>\n",
       "        <td>0.46</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Gensim: batch_size=10, burn_in=0, epochs=50, neg=20, reg=0.0</td>\n",
       "        <td>spearman<br></td>\n",
       "        <td>0.47</td>\n",
       "        <td>0.45</td>\n",
       "        <td>0.47</td>\n",
       "        <td>0.47</td>\n",
       "        <td>0.48</td>\n",
       "        <td>0.47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Gensim: batch_size=10, burn_in=0, epochs=50, neg=10, reg=0.0</td>\n",
       "        <td>spearman<br></td>\n",
       "        <td>0.46</td>\n",
       "        <td>0.46</td>\n",
       "        <td>0.45</td>\n",
       "        <td>0.47</td>\n",
       "        <td>0.47</td>\n",
       "        <td>0.48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Gensim: batch_size=50, burn_in=0, epochs=50, neg=20, reg=0.0</td>\n",
       "        <td>spearman<br></td>\n",
       "        <td>0.46</td>\n",
       "        <td>0.46</td>\n",
       "        <td>0.47</td>\n",
       "        <td>0.47</td>\n",
       "        <td>0.48</td>\n",
       "        <td>0.47</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Gensim: batch_size=10, burn_in=10, epochs=200, neg=10, reg=1</td>\n",
       "        <td>spearman<br></td>\n",
       "        <td>0.52</td>\n",
       "        <td>0.51</td>\n",
       "        <td>0.51</td>\n",
       "        <td>0.51</td>\n",
       "        <td>0.52</td>\n",
       "        <td>0.51</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "        <td>Numpy: epochs=50, neg=20</td>\n",
       "        <td>spearman<br></td>\n",
       "        <td>0.15</td>\n",
       "        <td>0.19</td>\n",
       "        <td>0.20</td>\n",
       "        <td>0.20</td>\n",
       "        <td>0.24</td>\n",
       "        <td>0.26</td>\n",
       "    </tr>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display_results('Lexical Entailment (HyperLex)', entailment_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Results from paper (for Poincaré Embeddings, as well as other embeddings from previous papers) - \n",
    "![LE Results](https://raw.githubusercontent.com/RaRe-Technologies/gensim/poincare_model_keyedvectors/docs/notebooks/poincare/entailment_paper.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some observations - \n",
    "1. We achieve a max spearman score of 0.48, fairly close to the spearman score of 0.512 mentioned in the paper.\n",
    "2. The best results are obtained with 20 negative examples, a batch size of 10, and no burn-in, however the differences are too low to make a meaningful conclusion.\n",
    "\n",
    "However, there are a few ambiguities and caveats - \n",
    "1. The paper does not mention which hyperparameters and model size have been used for the above mentioned result. Hence it is possible that the results are achieved with a significantly lower model size than the one we use, which would imply that our implementation still has some way to go.\n",
    "2. The same word can have multiple nodes in the WordNet dataset for different senses of the word, and it is unclear in the paper how to decide which node to pick. For the above results, we have gone with the sane default of picking the particular sense that has the maximum similarity score with the target word.\n",
    "3. Certain words in the HyperLex dataset seem to be absent from the WordNet data - the paper does not mention any such thing. Pairs containing missing words have been omitted from the evaluation (182/2163).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.4 Link Prediction on Collaboration Networks\n",
    "\n",
    "The paper also describes a variant of the Poincaré model to learn embeddings of nodes in a symmetric graph, unlike the WordNet noun hierarchy, which is directed and asymmetric. The datasets used in the paper for this model are scientific collaboration networks, in which the nodes are researchers and an edge represents that the two researchers have co-authored a paper.\n",
    "\n",
    "This variant has not been implemented yet, and is therefore not a part of our experiments."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Next Steps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. The model can be investigated further to understand why it doesn't produce results as good as the paper. It is possible that this might be due to training details not present in the paper, or due to us incorrectly interpreting some ambiguous parts of the paper. We have not been able to clarify all such ambiguitities in communication with the authors.\n",
    "2. Optimizing the training process further - with a model size of 50 dimensions and a dataset with ~700k relations and ~80k nodes, the Gensim implementation takes around 45 seconds to complete an epoch (~15k relations per second), whereas the open source C++ implementation takes around 1/6th the time (~95k relations per second).\n",
    "3. Implementing the variant of the model mentioned in the paper for symmetric graphs and evaluating on the scientific collaboration datasets described earlier in the report."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
